14 research outputs found

    Pretenuring for Java

    Get PDF
    Pretenuring is a technique for reducing copying costs in garbage collectors. When pretenuring, the allocator places long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on profiling-driven pretenuring as follows. (1) We develop a collector-neutral approach to obtaining object lifetime profile information. We show that our collection of Java programs exhibits a very high degree of homogeneity of object lifetimes at each allocation site. This result is robust with respect to different inputs, and is similar to previous work on ML, but is in contrast to C programs, which require dynamic call chain context information to extract homogeneous lifetimes. Call-site homogeneity considerably simplifies the implementation of pretenuring and makes it more efficient. (2) Our pretenuring advice is neutral with respect to the collector algorithm, and we use it to improve two quite different garbage collectors: a traditional generational collector and an older-first collector. The system is also novel because it classifies and allocates objects into 3 categories: we allocate immortal objects into a permanent region that the collector will never consider, long-lived objects into a region in which the collector placed survivors of the most recent collection, and shortlived objects into the nursery, i.e., the default region. (3) We evaluate pretenuring on Java programs. Our simulation results show that pretenuring significantly reduces collector copying for generational and older-first collectors. 1

    Loop Fusion for Data Locality and Parallelism

    No full text
    Modern processors use memory hierarchy of several levels. Achieving high performance mandates the effective use of the cache locality. Compiler transformations can relieve the programmer from handoptimizing for the specific machine architectures. Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can increase data locality thereby exploiting better cache locality; it can also increase the granularity of parallel loops, thereby decreasing barrier synchronization overhead and improving program performance. However, very large granularity loops are undesirable, if they introduce register spills inside the loop. Previous approaches to the fusion problem have considered all these factors in isolation. In this work, we present a new model which considers data locality and parallelism together subject to the register pressure. We build a weighted directed acyclic graph, called the fusion graph, in which the nodes represent loops and the weights on edg..

    A Parametrized Loop Fusion Algorithm for Improving Parallelism and Cache Locality

    No full text
    Loop fusion is a reordering transformation that merges multiple loops into a single loop. It can increase data locality and the granularity of parallel loops, thus improving the program performance. Previous approaches to this problem have looked at these two benefits in isolation. In this work, we propose a new model which considers data locality, parallelism, and register pressure together. We build a weighted directed acyclic graph in which the nodes represent program loops along with their register pressure, and the edges represent the amount of locality and parallelism present. The direction of an edge represents an execution order constraint. We then partition the graph into components such that the sum of the weights on the edges cut is minimized, subject to the constraint that the nodes in the same partition can be safely fused together, and the register pressure of combined loop does not exceed the number of available registers. Previous work demonstrates that the general problem of finding optimal partitions is NP-hard. In restricted cases, we show that it is possible to arrive at the optimal solution. We give an algorithm for the restricted case and a heuristic for the general case. We demonstrate the effectiveness of fusion and our approach with experimental results

    A Parallel Implementation of a Correspondence-Finder for Uncalibrated Stereo Image Pairs

    No full text
    We report on our experience with parallelizing a computer vision algorithm. The algorithm employs low-level image processing techniques which are relatively easy to parallelize and intermediate-level computer vision techniques which lack the regularity and locality of image processing algorithms. The application is an excellent candidate for use as a benchmark. We implement two parallel versions of this algorithm; the second one based on our experience with the first version. We program the parallel implementation in the Single Program, Multiple Data (SPMD) model using the MPI message passing interface. We evaluate our implementation on a four node IBM SP. Our results show excellent speedup numbers for the image processing portion and good speedups for most of the application. However, part of the application is inherently sequential. Our second parallel implementation is not only more efficient than the first, but it also has better speedup numbers. In addition, we suggest changes to ..

    Pretenuring For Java

    No full text
    Pretenuring can reduce copying costs in garbage collectors by allocating long-lived objects into regions that the garbage collector will rarely, if ever, collect. We extend previous work on pretenuring as follows. (1) We produce pretenuring advice that is neutral with respect to the garbage collector algorithm and configuration. We thus can and do combine advice from different applications. We find that predictions using object lifetimes at each allocation site in Java programs are accurate, which simplifies the pretenuring implementation. (2) We gather and apply advice to applications and the Jalapeño JVM, a compiler and run-time system for Java written in Java. Our results demonstrate that building combined advice into Jalapeño from different application executions improves performance regardless of the application Jalapeño is compiling and executing. This build-time advice thus gives user applications some benefits of pretenuring without any application profiling. No previous work pretenures in the run-time system. (3) We find that application-only advice also improves performance, but that the combination of build-time and application-specific advice is almost always noticeably better. (4) Our same advice improves the performance of generational and Older First collection, illustrating that it is collector neutral
    corecore